Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix error after remote tidy timeout #4465

Merged
merged 6 commits into from
Oct 15, 2021
Merged

Conversation

MetRonnie
Copy link
Member

@MetRonnie MetRonnie commented Oct 12, 2021

This is a small change with no associated Issue.

Fix a ValueError that could occur during remote tidy during shutdown:

2021-10-11T11:38:10Z ERROR - Error during shutdown
2021-10-11T11:38:10Z ERROR - too many values to unpack (expected 2)
	Traceback (most recent call last):
	  File "/home/runner/work/cylc-flow/cylc-flow/cylc/flow/scheduler.py", line 1613, in shutdown
	    await self._shutdown(reason)
	  File "/home/runner/work/cylc-flow/cylc-flow/cylc/flow/scheduler.py", line 1708, in _shutdown
	    self.task_job_mgr.task_remote_mgr.remote_tidy()
	  File "/home/runner/work/cylc-flow/cylc-flow/cylc/flow/task_remote_mgr.py", line 347, in remote_tidy
	    for platform_n, (cmd, proc) in procs.items():
	ValueError: too many values to unpack (expected 2)

I think that might have been why _remote_background_indep_poll was so flaky on GH Actions.

In the process I refactored TaskRemoteMgr.remote_tidy() to use a queue of processes instead of a dict, much like how remote clean works.

Requirements check-list

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.py and conda-environment.yml.
  • Does not need tests.
  • Appropriate change log entry included.
  • No documentation update required.

@MetRonnie MetRonnie added the bug? Not sure if this is a bug or not label Oct 12, 2021
@MetRonnie MetRonnie added this to the cylc-8.0rc1 milestone Oct 12, 2021
@MetRonnie MetRonnie self-assigned this Oct 12, 2021
# Terminate any remaining commands
for platform_n, (cmd, proc) in procs.items():
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the offending line

@MetRonnie MetRonnie modified the milestones: cylc-8.0rc1, cylc-8.0b3 Oct 12, 2021
@hjoliver
Copy link
Member

I think that might have been by _remote_background_indep_poll was so flaky on GH Actions.

Nice, kudos if that works out 🥇

cylc/flow/task_remote_mgr.py Show resolved Hide resolved
self.bad_hosts.add(host)
while queue and time() < timeout:
item = queue.popleft()
if item.proc.poll() is None: # proc still running
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use proc.wait(timeout=10) to avoid this polling loop and the subsequent termination loop e.g:

for proc in procs:
    try:
        if proc.wait(10):
            ...
        else:
            ...
    except subprocess.TimeoutExpired:
        ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would negate the concurrency in the way the queue is currently handled?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The processes would still be concurrent, however, the result retrieval would be sequential:

from subprocess import *
from time import time

procs = [
    Popen(['sleep', '2'])
    for _ in range(5)
]

start = time()

for proc in procs:
    proc.wait()

print(f'{time() - start}')
$ python test.py
2.007575035095215

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, but considering that retries for SSH 255 failures are queued after result retrieval, it would still slow it down in those cases, I'd have thought

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, damn, adding to the list whilst processing it makes a mess of things.

@MetRonnie
Copy link
Member Author

Number of failed tests (before retry) now a lot less.

Before:

Test Summary Report
-------------------
tests/k/cylc-poll/16-execution-time-limit.t                                    (Wstat: 0 Tests: 4 Failed: 3)
  Failed tests:  2-4
tests/f/cylc-clean/01-remote.t                                                 (Wstat: 0 Tests: 10 Failed: 1)
  Failed test:  2
tests/f/cylc-ping/04-check-keys-remote.t                                       (Wstat: 0 Tests: 5 Failed: 2)
  Failed tests:  2, 5
tests/f/platforms/02-host-to-platform-upgrade.t                                (Wstat: 0 Tests: 6 Failed: 1)
  Failed test:  5
tests/f/job-submission/09-activity-log-host-bad-submit.t                       (Wstat: 0 Tests: 2 Failed: 1)
  Failed test:  2
tests/f/job-submission/02-job-nn-remote-host.t                                 (Wstat: 0 Tests: 2 Failed: 1)
  Failed test:  2
tests/f/intelligent-host-selection/03-polling.t                                (Wstat: 0 Tests: 4 Failed: 1)
  Failed test:  2
tests/f/events/10-task-event-job-logs-retrieve.t                               (Wstat: 0 Tests: 4 Failed: 1)
  Failed test:  2
tests/f/job-submission/17-remote-localtime.t                                   (Wstat: 0 Tests: 2 Failed: 1)
  Failed test:  2
tests/f/events/17-task-event-job-logs-retrieve-command.t                       (Wstat: 0 Tests: 3 Failed: 1)
  Failed test:  2
tests/f/database/03-remote.t                                                   (Wstat: 0 Tests: 3 Failed: 1)
  Failed test:  2
tests/f/events/33-task-event-job-logs-retrieve-3.t                             (Wstat: 0 Tests: 5 Failed: 2)
  Failed tests:  3-4
tests/f/job-submission/13-tidy-submits-of-prev-run-remote-host.t               (Wstat: 0 Tests: 11 Failed: 2)
  Failed tests:  2, 7
tests/f/cylc-cat-log/11-remote-retrieve.t                                      (Wstat: 0 Tests: 7 Failed: 1)
  Failed test:  2
tests/f/events/11-cycle-task-event-job-logs-retrieve.t                         (Wstat: 0 Tests: 3 Failed: 2)
  Failed tests:  2-3
tests/f/remote/06-poll.t                                                       (Wstat: 0 Tests: 4 Failed: 1)
  Failed test:  2
tests/f/remote/00-basic.t                                                      (Wstat: 0 Tests: 4 Failed: 1)
  Failed test:  2
tests/f/restart/26-remote-kill.t                                               (Wstat: 0 Tests: 5 Failed: 1)
  Failed test:  2
tests/f/cylc-cat-log/10-remote-no-retrieve.t                                   (Wstat: 0 Tests: 5 Failed: 1)
  Failed test:  2
tests/f/authentication/01-remote-workflow-same-name.t                          (Wstat: 0 Tests: 3 Failed: 1)
  Failed test:  3
tests/f/cylc-cat-log/08-editor-remote.t                                        (Wstat: 0 Tests: 16 Failed: 1)
  Failed test:  2

After:

Test Summary Report
-------------------
tests/f/cylc-ping/04-check-keys-remote.t                                       (Wstat: 0 Tests: 5 Failed: 1)
  Failed test:  5
tests/f/events/33-task-event-job-logs-retrieve-3.t                             (Wstat: 0 Tests: 5 Failed: 2)
  Failed tests:  3-4
tests/f/events/17-task-event-job-logs-retrieve-command.t                       (Wstat: 0 Tests: 3 Failed: 2)
  Failed tests:  2-3
tests/f/remote/06-poll.t                                                       (Wstat: 0 Tests: 4 Failed: 1)
  Failed test:  3
tests/k/cylc-poll/16-execution-time-limit.t                                    (Wstat: 0 Tests: 4 Failed: 2)
  Failed tests:  3-4
tests/f/events/10-task-event-job-logs-retrieve.t                               (Wstat: 0 Tests: 4 Failed: 3)
  Failed tests:  2-4
tests/f/events/11-cycle-task-event-job-logs-retrieve.t                         (Wstat: 0 Tests: 3 Failed: 2)
  Failed tests:  2-3

Copy link
Contributor

@datamel datamel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have checked out the branch, read the code. Only issue I have found is the purge has been commented out in the ihs test.

@oliver-sanders oliver-sanders added bug Something is wrong :( and removed bug? Not sure if this is a bug or not labels Oct 13, 2021
Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 thanks @MetRonnie

@hjoliver hjoliver merged commit 21ff967 into cylc:master Oct 15, 2021
@MetRonnie MetRonnie deleted the remote-tidy-fix branch October 15, 2021 15:26
@MetRonnie MetRonnie changed the title Attempt to address remote test flakiness Fix error after remote tidy timeout Oct 21, 2021
@MetRonnie MetRonnie linked an issue Nov 1, 2021 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

Successfully merging this pull request may close these issues.

speed up _remote_background_indep_poll tests
4 participants